[00:03.850 --> 00:10.030]  Hello everyone, my name is Apoorva Singh Gautam. I would like to thank Defcon Retain Village
[00:10.030 --> 00:16.610]  for having me here. I will talk about automating threat hunting on the dark web and the things that
[00:16.610 --> 00:24.410]  surrounds it. I presented this talk, I presented a short version of this talk at Grimcon and this is
[00:24.410 --> 00:31.150]  the extended version of it. So let's get started. I will switch off my webcam so that you guys can
[00:31.150 --> 00:40.620]  focus on the presentation here. Okay, so a little about me. My name is Apoorva Singh Gautam. I go
[00:40.620 --> 00:46.220]  by the handle ASGScorpion. I am a security researcher. I started into threat intel or
[00:46.220 --> 00:53.900]  hunting two years back and I've been loving it since then. I'm currently pursuing my master's
[00:53.900 --> 01:00.140]  in cyber security at Georgia Tech. Recently, during the summer, I was a research intern at
[01:00.140 --> 01:09.380]  UC Berkeley doing research in threat intelligence. Some of my hobbies are gaming. I pretty much
[01:09.380 --> 01:16.840]  love Rainbow Six Siege. I sometimes stream it also. I love hiking. I recently started
[01:16.840 --> 01:23.300]  into lockpicking and I've been enjoying it. I contribute to the security community.
[01:23.300 --> 01:28.740]  It's my passion contributing to the security community. I'm a senior teaching
[01:28.740 --> 01:34.600]  assistant at Siberia. I contribute at Station X also and at the local security meetups.
[01:35.040 --> 01:40.380]  And there are my socials if you want to contact me or hit me up.
[01:42.100 --> 01:48.600]  So what's today's agenda? So we will talk about introduction to dark web, what dark web is,
[01:48.600 --> 01:54.700]  how to access dark web, what Tor is, what is the difference between dark web, deep web,
[01:54.700 --> 01:58.720]  why you should perform hunting on the dark web. Before that, we will discuss
[01:58.720 --> 02:03.620]  what is threat hunting and why it is crucial to hunt on the dark web. We will discuss
[02:04.320 --> 02:10.520]  different methods to hunt on the dark web. Can the dark web hunting be automated?
[02:10.600 --> 02:15.840]  What's the pipeline or the architecture of hunting, automating the dark web hunting?
[02:16.420 --> 02:22.000]  Then we will discuss a little about threat intelligence lifecycle. That's how
[02:22.000 --> 02:29.180]  threat hunting on the dark web is analogous to threat intelligence lifecycle. What steps are
[02:29.180 --> 02:34.740]  there that corresponds to this? We will discuss a little about operational security, that's OPSEC,
[02:34.740 --> 02:41.600]  and why is it important to secure yourself when hunting on the dark web. And yeah, that's it.
[02:43.320 --> 02:49.420]  So introduction to dark web. I'm sure you must have seen this image a lot of times on the internet
[02:49.420 --> 02:55.560]  that shows difference between surface web, deep web, and dark web. So the surface web is the
[02:56.220 --> 03:01.100]  sites that are indexed by different search engines, that's Google, Bing, Yahoo, etc.
[03:01.620 --> 03:08.760]  And the majority of the portion of the internet is deep web. Now deep web is any site, any website
[03:08.760 --> 03:15.700]  that is not indexed by the search engine. So this can include your databases, your server instances,
[03:15.700 --> 03:22.180]  or any other different websites that you cannot search from the search engines.
[03:22.860 --> 03:29.480]  The third one is the dark web that we will talk mostly about today. And this is the part of the
[03:29.480 --> 03:35.860]  internet where you need some kind of software, special software to access the dark web. So this
[03:35.860 --> 03:43.300]  can include anything related to like drugs or weapons that's being sold on the dark web, or
[03:43.300 --> 03:52.860]  some kind of research, or books that's sold there. So dark web includes several forums and
[03:52.860 --> 03:58.280]  marketplaces where people sell different kind of things. So these type of things are sold there.
[04:00.140 --> 04:06.280]  So majority portion of the internet is deep web, and people confuse it between dark web and deep
[04:06.280 --> 04:13.100]  web. So as you can see, only 6% of the internet is dark web, and 80 to 90% is the deep web.
[04:16.090 --> 04:22.170]  Moving on, so how do you access the dark web? So there are several companies, several organizations
[04:22.170 --> 04:29.730]  that offer their own dark web systems, or you can call it, so the famous one is Tor,
[04:29.730 --> 04:37.090]  that's the onion router. The second is I2P, that's an invisible internet project.
[04:37.090 --> 04:48.050]  Zeronet is also becoming popular nowadays. So Tor has .onion domain, like the domain name
[04:48.050 --> 04:55.590]  ends with .onion, and I2P is .i2p. Talking more about Tor, what Tor is and how it works,
[04:55.590 --> 05:07.070]  is it's like a three-layer proxy system. If you can see on the image here, there are entry nodes,
[05:07.710 --> 05:12.470]  the entry node is where your traffic goes to, and then it goes through the middle layer,
[05:12.470 --> 05:19.450]  and the exit layer, and then goes to the destination. This way your identity is hidden,
[05:19.450 --> 05:26.370]  and also only the entry node is publicly listed, rest of the nodes are not publicly listed.
[05:26.830 --> 05:36.050]  So it hides your identity, it hides your IP address, and these nodes are volunteer-based
[05:36.050 --> 05:43.530]  systems. So Tor has about like 6,000 relays, and I don't know the number about I2P,
[05:43.530 --> 05:55.470]  but it is also becoming more popular nowadays. And the major thing about Tor is each node
[05:56.230 --> 06:04.370]  only knows the IP address of the next node or the previous node. So if we talk about here,
[06:04.370 --> 06:09.150]  the entry node doesn't know about the exit node or vice versa. So in this way,
[06:09.150 --> 06:18.170]  the location of the nodes are also protected. So there are many misconceptions about Tor,
[06:18.170 --> 06:25.870]  the dark web. The famous one is whenever we talk about Tor, people think about criminal
[06:25.870 --> 06:32.110]  or criminal things that goes on Tor. Yes, there is criminal side of Tor,
[06:32.110 --> 06:39.150]  but there is a good side of the Tor also, or the dark web. So it's really famous among
[06:39.150 --> 06:47.670]  whistleblowers or activists. There are many countries where free speech is limited. So
[06:48.470 --> 06:53.550]  they can use Tor, like people from those countries use Tor to express their speech
[06:53.550 --> 07:04.050]  or express what they think. Tor also has access to many old literature or researches,
[07:04.050 --> 07:11.070]  which is not available on the open web. And it's like safe haven for journalists, obviously.
[07:11.790 --> 07:16.410]  So there are many popular sites like Facebook and New York Times that has their counterpart
[07:16.410 --> 07:25.930]  on Tor, like their .NER website counterpart. So it's useful for whistleblowers or activists.
[07:26.290 --> 07:33.050]  The second thing is Tor is so many people think that it's illegal to access the dark web.
[07:33.350 --> 07:41.230]  So it's not illegal to access the dark web. It's illegal to indulge in these kinds of activities
[07:41.230 --> 07:46.950]  like purchasing drugs or purchasing any other illegal things on the dark web that
[07:46.950 --> 07:53.730]  is being sold there. But it's completely legal to access the dark web. And it's like
[07:55.870 --> 08:01.810]  the last thing which I would talk about is many people think Tor is like really big.
[08:01.810 --> 08:07.150]  But if you talk about the uptime or the availability of sites, there are very few
[08:07.150 --> 08:13.690]  onion domains on the dark web if you compare it with the clear web. So it's like
[08:14.110 --> 08:18.670]  the dark web is like very little part of the internet.
[08:22.040 --> 08:28.300]  So we will talk about the dark side of the dark web. I mean, the criminal side of the dark web.
[08:28.300 --> 08:33.960]  So there are many forums or marketplaces on the dark web. These are some of the relevant
[08:33.960 --> 08:40.900]  sites that's relevant to security researchers or people who want to access dark web for their
[08:41.740 --> 08:47.660]  organization's benefit. So this is like some of them are like a credit card market. So where
[08:47.660 --> 08:55.800]  different credit cards are being dumped or remote access. So these forums include remote access
[08:55.800 --> 09:00.600]  trojans or some kind of remote access tools or insider threats. So insider threats are
[09:00.600 --> 09:08.460]  it's like recent coming up a forum where insider like the people who are selling
[09:08.460 --> 09:13.520]  their company's secrets, they talk amongst themselves. So these are some of the relevant
[09:13.520 --> 09:22.420]  sites. Now coming to the cost. So how much it costs for some kind of for something to
[09:23.210 --> 09:27.980]  buy from the dark web. So as you can see, it's really easy to buy these things on the dark web
[09:27.980 --> 09:35.820]  and it. So like SSN, you can buy any SSN for $1 or fake FB friends, a fake FB with 15 friends,
[09:35.820 --> 09:44.440]  mobile malware, bank details, so exploit to zero days. So it's like these type of things are
[09:44.440 --> 09:49.780]  easy to get on the dark web. That's why security researchers and other people are focusing more
[09:49.780 --> 09:55.400]  on the dark web. You might have heard about recent news, like 500,000 Zoom accounts sold
[09:55.400 --> 10:02.140]  on the dark web or 267 million FB user profiles sold on the dark web. So there are many
[10:02.700 --> 10:07.440]  data breaches occurring day by day and they are being sold on the dark web. That's why
[10:07.440 --> 10:15.080]  researching on the dark web is really important. These are some of the product listings
[10:15.860 --> 10:20.540]  from the forums of the dark web. This is how the products are being listed.
[10:23.420 --> 10:29.540]  And you can see these are the average cost of accounts for different online services like bank
[10:29.540 --> 10:34.860]  services, what's the average cost of it or what's the average cost of service for video games.
[10:35.940 --> 10:41.320]  This is the average cost of tools that is being sold on the dark web. As you can see again,
[10:41.320 --> 10:49.180]  bank and financial average cost is $74. So you can buy a brute forcing tool for some bank at $74.
[10:51.520 --> 10:56.880]  Now coming to why you should hunt on the dark web. Before that, let's talk about what is threat
[10:56.880 --> 11:05.700]  hunting. Threat hunting is proactive searching for cyber threats. Proactive means before the attack
[11:05.700 --> 11:11.820]  happens. That's proactive search. You search for cyber threats from logs or indicators of
[11:11.820 --> 11:18.500]  compromise. That's IP addresses, emails, domains, etc. or textual data. That's what we are doing
[11:18.500 --> 11:24.840]  when we are searching the dark web. So it's basically hypothesis based because there's
[11:24.840 --> 11:30.540]  nothing concrete about the process. You take one use case and work on it and then you take
[11:30.540 --> 11:37.600]  another use case and work on it. And it goes iteratively in the same way. Many times there's
[11:37.600 --> 11:43.160]  use of machine learning or natural language processing that's NLP and advanced analytics
[11:44.160 --> 11:50.560]  process in this because you need to scan through the textual data if you are hunting on the dark
[11:50.560 --> 11:56.120]  web or if you are hunting on the clear web also but for textual data. So machine learning and
[11:56.120 --> 12:05.420]  advanced analytics are useful there. So why it's serious? Why threat hunting on the dark web is
[12:05.420 --> 12:13.700]  really important? What's so big about that? As I told you about again, there are many forums,
[12:13.700 --> 12:21.380]  marketplaces, dump shops. So what criminals or actors do? They learn new methods and techniques
[12:21.380 --> 12:27.400]  on the dark web. They monetize their skills. They trade their exploits or tools or even drugs and
[12:27.400 --> 12:35.200]  weapons and communication. So they communicate with each other and they share their ideas for
[12:35.200 --> 12:43.220]  new attacks. A security researcher or a person who is researching on the dark web, he can find
[12:43.460 --> 12:50.040]  a lot on, he can learn a lot while engaging within these communities. You can learn their
[12:50.700 --> 12:57.900]  techniques or TTPs, that's the attack techniques and procedures, how they think about the attack,
[12:57.900 --> 13:04.660]  how they plan an attack and it can identify. So if you do it correctly, it can identify attacks
[13:04.660 --> 13:12.040]  in the earlier stages, that's planning and recon stages and you can reduce the impacts that it
[13:12.040 --> 13:18.600]  causes. So suppose if your organization data is being sold on the dark web, there are different
[13:18.600 --> 13:23.780]  kinds of impacts that can cause to your organization. So some of the direct impacts
[13:23.780 --> 13:30.300]  are like personal information stolen or healthcare records stolen or even your company's trade
[13:30.300 --> 13:36.680]  secrets. And some of the indirect impacts are repetition of your organization, revenue loss,
[13:36.680 --> 13:41.900]  and nowadays the legal penalties that your data is lost and you have to cover the cost
[13:41.900 --> 13:51.500]  of the customers. So this is like, that's why this is really important to like to research
[13:51.500 --> 14:00.660]  on the dark web, to hunt threats on the dark web. On the same lines, these are the benefits of the
[14:00.660 --> 14:06.240]  threat hunting. So if you do it correctly, you can keep up with the latest trends of the attacks,
[14:06.240 --> 14:13.380]  you can get new TTPs, that's tactics, techniques and procedures, you can identify insider threats,
[14:13.380 --> 14:18.360]  you can discover data breaches. The main thing is you can prepare your SOCs and incident responders
[14:18.360 --> 14:27.640]  to deal with the attack because they will know before only what are the TTPs attackers are using.
[14:27.640 --> 14:32.600]  So they can reduce the damage and risk to the organization by acting quickly on that.
[14:35.800 --> 14:41.440]  So coming to the methods to hunt on the dark web. So we will discuss about some tools that's
[14:41.440 --> 14:45.760]  used to hunt on the dark web. And then we will discuss about the human element that can be used
[14:45.760 --> 14:52.400]  to hunt on the dark web. So talking about the first tool that's really, really important for
[14:52.400 --> 15:00.700]  this is Scrapy. So it's a web crawling framework. It's so famous, it's so important because
[15:00.700 --> 15:06.260]  it manages multithreading automatically. So you don't have to write, you don't have to spend too
[15:06.260 --> 15:11.440]  much time on the multithreading part, because it has already capabilities for multithreading
[15:11.960 --> 15:19.320]  using one or two lines of one or two lines of parameters. The second thing is Tor. Obviously,
[15:19.320 --> 15:26.120]  if you want to access the dark web, you need Tor. Onion scan is another tool that is used to search
[15:26.120 --> 15:33.760]  for onion websites. It can tell you if a website is up or not and the correlation between different
[15:33.760 --> 15:43.420]  websites on the dark web. Coming to Pryvoxy. So Pryvoxy is a web proxy. Before getting more into
[15:43.420 --> 15:51.620]  this. So when you access the dark web, you need some kind of proxy to access the dark web because
[15:51.620 --> 15:59.480]  your ISP, as I told you before, the entry nodes are publicly listed. So your ISP can have a
[15:59.480 --> 16:06.720]  blacklist to block the entry nodes. So you can't access the dark web or even if the ISP doesn't
[16:06.720 --> 16:13.240]  block it, he can see whether you are accessing the dark web or not. He cannot see what you are
[16:13.240 --> 16:17.800]  doing on there, but he can see whether you are accessing the dark web or not. So you might not
[16:17.800 --> 16:24.060]  want that. That's why you need some kind of proxy and a majority of people use SOCKS proxy. So
[16:24.060 --> 16:30.900]  basic difference between HTTP and SOCKS proxy is SOCKS proxy is a lower level proxy and it
[16:30.900 --> 16:37.460]  works on the SOCKS protocol. HTTP proxy only works on HTTP or HTTPS websites, but SOCKS proxy can
[16:37.460 --> 16:47.880]  work on other protocols also. And there are different tools to use SOCKS proxy like
[16:47.880 --> 16:55.640]  TSOCKS, Polypo and Pryvoxy. I've been using Pryvoxy and it has been so I don't have any
[16:55.640 --> 17:03.860]  problem with Pryvoxy. So and it's good. So another thing is SCP doesn't allow you to use
[17:05.060 --> 17:11.740]  directly SOCKS proxy because it doesn't support SOCKS proxy. So that's why you have to use these
[17:11.740 --> 17:19.660]  tools like Pryvoxy, TSOCKS or Polypo to route your SOCKS through Pryvoxy scripts.
[17:20.380 --> 17:26.800]  And there are other tools also like their search engines like Kilos or Recon where you can find
[17:26.800 --> 17:35.540]  different end-to-end domains. Apart from using SOCKS proxy, you can also use VPN with Tor for
[17:35.540 --> 17:47.790]  extra layer of protection and like encrypting your data. So getting more into SCP part,
[17:47.790 --> 17:55.050]  this image might seem a little confusing, but I will get into it like step by step.
[17:55.050 --> 18:03.250]  So this is like why SCP is really important and why SCP is so useful in hunting on the dark web.
[18:03.250 --> 18:11.590]  So if you can see there, so for explaining this, I will explain it in terms of Python code. So
[18:11.590 --> 18:17.550]  suppose everything you see here is a different Python program. So spider is a Python program,
[18:17.550 --> 18:25.030]  downloader is a Python program, middleware is a Python program and so on. So what spider does
[18:25.030 --> 18:32.550]  in spider? So in spider Python program, you give your onion domain on which you want to
[18:32.550 --> 18:40.270]  crawl the data or which you want to get the data. So it gives it to the engine. Suppose engine is
[18:40.270 --> 18:49.110]  just the program that manages every other Python programs. So it gives the onion domain to the
[18:49.110 --> 18:54.370]  engine. The engine gives it to the scheduler. So what scheduler does is here the multi-threading
[18:54.370 --> 18:59.990]  concept comes into the picture. Scheduler gets different domains and schedules it accordingly
[18:59.990 --> 19:06.450]  into multiple threads. So the onion domain goes into the scheduler, the scheduler gives it back
[19:06.450 --> 19:12.730]  to the engine and engine gives it to the middleware. So middleware program includes your
[19:12.730 --> 19:20.390]  proxy program and or login program. So what proxy program is that in proxy function. So if we are
[19:20.390 --> 19:25.410]  talking about middleware, that's a Python program, there's a proxy function into it.
[19:25.410 --> 19:32.870]  So proxy function is where you will note, you will put down your privoxy IP or Tor IP,
[19:32.870 --> 19:38.290]  so that the request response goes and comes through that proxy so that you can access the
[19:38.290 --> 19:46.670]  dark web. The login program, the login function is where you will put your user agents or cookies or
[19:48.650 --> 19:57.050]  so for accessing the dark web forums, you have to, so there nowadays, for all the forums,
[19:57.050 --> 20:01.690]  you have to have an account to access the dark web or to access that particular forum.
[20:01.690 --> 20:08.310]  So to access the forum, you need some kind of cookies or also there are many forums,
[20:08.310 --> 20:13.690]  many high level forums that implement captures. And as Google doesn't work on the dark web,
[20:13.690 --> 20:19.210]  these captures are like image-based capture or text-based capture that is easy to bypass.
[20:19.290 --> 20:26.590]  So you can use any machine learning capture bypassing service or any capture bypassing
[20:26.590 --> 20:31.870]  websites like death by capture or anti-capture to bypass the capture. So these are all codes
[20:31.870 --> 20:39.870]  you will write in this login function in middleware program. Now your request or your
[20:39.870 --> 20:45.150]  traffic goes through this middleware program to the downloader. What downloader does is it's a
[20:45.150 --> 20:51.230]  simple program to extract the HTML and save and give it back to the engine. So downloader extract
[20:51.230 --> 20:58.430]  the HTML and gives it back to the engine. Now engine gives it back to the spider. Now there's
[20:58.430 --> 21:06.050]  another function in spider that extracts the HTML entities that you want from the forums HTML,
[21:06.050 --> 21:15.330]  suppose forum name or document ID or the text-based data or author name who posted a
[21:15.330 --> 21:22.850]  particular content. So you get that and it is called items in Scrapy. So you get these items
[21:22.850 --> 21:29.390]  and it sends it to the item pipeline. Item pipeline is where your database is configured.
[21:29.390 --> 21:35.810]  So I use Elasticsearch, you can use any database whether SQL or NoSQL and it directly saves the
[21:35.810 --> 21:45.470]  items to the SQL. And so the important thing to note here is that Scrapy, so as I told you before
[21:45.470 --> 21:52.190]  multi-threading is automatically handled by Scrapy. The another thing is you don't have to give
[21:52.190 --> 21:58.850]  multiple onion domains to spider. So when the downloader gets the data, when the downloader
[21:58.850 --> 22:07.930]  gets the HTML page from the particular forum, you don't have to configure the code,
[22:07.930 --> 22:13.030]  there is a code to get all the onion domains on that particular HTML page. So the scheduler
[22:13.030 --> 22:18.590]  automatically schedules the other onion domains to go through the same process again and again.
[22:19.170 --> 22:26.470]  In this way, you don't have to give extra onion domains to the engine. And this is why Scrapy is
[22:27.070 --> 22:36.130]  really useful in crawling data from the dark web. So you can specify which domain to crawl
[22:36.130 --> 22:42.390]  and which domain to block. In this way, you can be safe from getting illegal data or getting
[22:42.390 --> 22:52.850]  illegal images. Moving on, now comes the human part. So we discussed about the tools, what tools
[22:52.850 --> 22:59.210]  can you use to hunt on the dark web. There is a human element also that's called human intelligence
[22:59.210 --> 23:06.250]  or HUMINT. So it's like it's a process of gathering intelligence through interpersonal contact,
[23:06.250 --> 23:13.010]  interpersonal contact rather by some kind of tools or technical process. That's why it's the most,
[23:13.010 --> 23:19.770]  it's most dangerous and difficult form because you are directly talking to the actor on the dark web,
[23:19.770 --> 23:28.470]  which is not safe. And it's not safe because you don't want your identity to be revealed
[23:28.470 --> 23:35.650]  to the actor or you don't want your organization's identity to be revealed to the actor. And it's
[23:35.650 --> 23:44.770]  important also because you can identify and respond to attacks much quickly. You can do
[23:44.770 --> 23:51.210]  post-attack investigation. So suppose your organization, like there's a data breach on
[23:51.210 --> 23:57.010]  your organization. If you want to confirm, so someone is selling this data on the dark web.
[23:57.010 --> 24:02.330]  And if you want to confirm whether they are selling the correct, whether they are lying
[24:02.330 --> 24:08.430]  about it or whether it is the truth. So you can activate your human intelligence or activate your
[24:09.090 --> 24:16.530]  the guy that is researching on the dark web to go and ask to the actors whether the data is
[24:16.530 --> 24:21.870]  correct or not. So that's post-attack investigation. You can also use it for new attacker
[24:21.870 --> 24:28.610]  vector, new attack vector discovery. So that's discovering new TTPs that the attackers are using
[24:28.610 --> 24:35.830]  or the attackers are discussing about. You can assume is as a high-tech equivalent of
[24:35.830 --> 24:43.870]  what an FBI agent does when he spends months or years working to infiltrate a criminal organization.
[24:43.870 --> 24:51.590]  That's why it's really hard to do it because you have to spend so much time on it. And that's why
[24:51.590 --> 25:00.170]  it's risky. And for this, you have to think like an actor, how they communicate within these
[25:00.170 --> 25:05.790]  communities, how they act within these communities. And the other thing is like
[25:05.790 --> 25:12.050]  it's the source from this is really valuable to your organization's safety.
[25:14.530 --> 25:20.250]  Now moving on, so we talked about tools, we talked about human intelligence, but now
[25:20.250 --> 25:26.350]  comes the pipeline or the architecture of how you can automate these threat hunting.
[25:26.350 --> 25:36.090]  So before that, I would suggest to set up a different system. You don't want your personal
[25:36.090 --> 25:42.430]  data to be on the system where you are doing threat hunting. So you can set up any lab or VM,
[25:42.430 --> 25:47.690]  whether physical or whether on cloud, just isolate the network and install relevant tools like
[25:47.690 --> 25:56.450]  KB, Privoxy, Tor. If you are using Elasticsearch or Kibana, then Elk and different Python libraries
[25:56.450 --> 26:03.630]  that would be necessary for your task. So this is the automated architecture
[26:04.170 --> 26:12.190]  that I have been using it. I will go it one by one. And I have this automate icon
[26:12.190 --> 26:19.550]  for the task that can be automated and for the task that I don't have, that's the only one I
[26:19.550 --> 26:25.910]  think is creepy setup and the design train NLP model. So it's hard to automate that part. So
[26:25.910 --> 26:33.170]  I will discuss it through. So first of all, you need to get the forums, forum links or the
[26:33.170 --> 26:39.970]  market links. So you can write a simple script to gather data from different search engines like
[26:39.970 --> 26:47.370]  I told you Recon and other search engines where you can get all the forum links.
[26:47.730 --> 26:53.270]  So that can be automated. So another thing is using SOCKS proxy. Like I told you,
[26:53.270 --> 26:58.850]  you have to use some kind of SOCKS proxy for this. So you can get SOCKS proxies IP. Again,
[26:58.850 --> 27:04.030]  you can write a simple script and get the SOCKS proxies that can be automated.
[27:04.790 --> 27:14.050]  Now comes the part of Scrappy setup. So Scrappy setup is is the so here you will write your
[27:14.990 --> 27:22.290]  login functions here. You will get your proxy setups and here you will like
[27:23.410 --> 27:28.010]  manage the settings of the Scrappy. Now, you can't automate this because
[27:28.850 --> 27:33.250]  when you get the onion links or different forum onion links, you have to go to the
[27:33.250 --> 27:41.410]  forums and sign in. Yes, you can automate the you can write scripts for logging in or signing in
[27:42.170 --> 27:49.290]  using different accounts. But I found it difficult to do this. That's why I just I
[27:49.290 --> 27:55.010]  have been using it manual. I've been doing it manually. So like creating different accounts
[27:55.010 --> 28:02.370]  like four to five accounts per forum and then noting down that into the Scrappy like the username,
[28:02.370 --> 28:08.750]  password and cookies into Scrappy. So for this, you have to do this. Also, you need different
[28:09.390 --> 28:14.010]  scrapers for different forums, because the architecture of forums is different for
[28:14.610 --> 28:22.870]  different forums. That's why you need to do this step manually, because you need to analyze,
[28:22.870 --> 28:30.010]  you need to first log into the forum, analyze it, and then write a different function for each
[28:30.010 --> 28:36.830]  forum for the HTML elements that you want to access. Coming to the crawler part. So
[28:37.650 --> 28:44.470]  what crawler is so the crawler parser analyzer, it's and the ELK part, it's all the part of the
[28:44.470 --> 28:50.970]  Scrappy that I discussed before. So these all are part of the Scrappy system. I've just written
[28:50.970 --> 28:55.930]  it differently. So you can understand what each part does. So what crawler does is again,
[28:55.930 --> 29:03.090]  pages from the forums, parser does it's parses the HTML pages, like getting the HTML elements
[29:03.090 --> 29:11.690]  like post, post content, author, etc. And the analyzer. So analyzer part is the part,
[29:11.690 --> 29:18.430]  you can write different function for this in Scrappy. So what analyzer does is so suppose
[29:18.430 --> 29:25.230]  you got the data from the from the dark web forum. Now you need to use some kind of techniques to
[29:25.230 --> 29:30.770]  evaluate the content that is relevant to your organization or relevant to your threat model.
[29:30.770 --> 29:36.990]  So we will discuss what threat modeling is in the later part of the presentation. So for now,
[29:36.990 --> 29:43.630]  just understand this, you can't focus on every other threat that's out there, you have to focus
[29:43.630 --> 29:48.990]  on threat that is relevant to your organization. So you need to do some kind of analysis to
[29:50.390 --> 29:55.710]  like get the data, get the relevant data from the dark web. So here comes the NLP model
[29:56.530 --> 30:04.650]  that I've been using. So you can design a trained NLP model in this way that it can
[30:05.710 --> 30:09.770]  just get you the content that is relevant to your organization.
[30:10.630 --> 30:17.330]  So you don't, suppose if you are a bank, you don't want to focus on tools or you don't want
[30:17.330 --> 30:23.510]  to focus on data breaches that's not relevant to your bank, you most likely would want to focus on
[30:23.510 --> 30:30.890]  the dump shops, that's where credit cards or debit cards are being dumped. So in this way,
[30:30.890 --> 30:35.150]  different organizations have different requirements, and you want to focus on those.
[30:35.150 --> 30:41.510]  Now, designing and training an NLP model can't be automated because you need some kind of content
[30:41.510 --> 30:48.610]  before content relevant to your threat model. And then you need to train your NLP model on that.
[30:49.550 --> 30:56.110]  It's like, so it's the same thing. It's somewhat like either you get the data first or you design
[30:56.110 --> 31:01.590]  your model first. So it's like egg and chicken problem. But nowadays, there are many NLP models
[31:01.590 --> 31:12.410]  like CDED LDA, where you can provide some kind of context before training the NLP model.
[31:12.910 --> 31:19.970]  So it's easy to do that. And then you store the data into ELK. So these all things can be
[31:19.970 --> 31:32.670]  automated. So coming to the part after getting this data, so what's the process after hunting?
[31:32.670 --> 31:36.450]  Now we'll discuss a little about Threat Intelligence Lifecycle.
[31:36.730 --> 31:41.230]  So what Threat Intelligence Lifecycle is, these are different steps that your organization takes
[31:43.390 --> 31:52.110]  to build a threat, like it starts from getting the data till presentation of the data.
[31:52.850 --> 32:01.750]  So how threat hunting on the dark web corresponds to this is, so there are like five phases,
[32:01.750 --> 32:05.970]  as you can see, direction, collection, processing, analysis, and dissemination.
[32:06.690 --> 32:11.950]  What we are doing is, we are doing direction phase from the human sources, like you can see
[32:11.950 --> 32:18.130]  from dark web social media forums. So in the direction phase, we identify dark web forums,
[32:18.130 --> 32:24.930]  we register on those forums, we acquire access on those forums. In the collection phase,
[32:24.930 --> 32:32.590]  you use Scrapy to establish access and collect raw data. Processing phase is also using Scrapy,
[32:32.590 --> 32:38.890]  so you parse raw HTML data, you extract topics and authors. And the analysis phase is where
[32:38.890 --> 32:45.670]  we use NLP and other machine learning models to infer relationship between these data.
[32:46.030 --> 32:52.030]  We get data that is relevant to our organization. We link data sources, we identify trends and
[32:52.030 --> 32:57.410]  hacks and leaks, etc. And dissemination phase is where we visualize the data in dashboards.
[32:57.410 --> 33:04.610]  If you are using Kibana or other kind of dashboards, we give out alerts and reports
[33:04.610 --> 33:14.150]  for our higher managers or the other people to see in our organization. So this is like
[33:14.150 --> 33:20.290]  crux of what threat hunting on the dark web maps to threat intelligence life cycle.
[33:21.730 --> 33:25.190]  Now, threat modeling. As I told you, I was going to talk about this
[33:25.970 --> 33:34.550]  in the coming presentation. So what threat modeling is, it's like getting your organization's
[33:37.310 --> 33:42.610]  critical assets and focusing on your organization's critical asset.
[33:42.610 --> 33:50.010]  So it's like understanding threats and how you can mitigate it when it happens to your
[33:50.010 --> 33:57.310]  organization particularly. So you understand what attacker wants, what different critical
[33:57.310 --> 34:04.190]  assets you have in your organization, what are different types of actors that can
[34:04.190 --> 34:09.150]  target you, whether they would be activists or insiders or some kind of criminal groups
[34:09.150 --> 34:17.210]  and know their capability. So here you choose your target on the dark web.
[34:20.090 --> 34:25.150]  So if you are a bank, you focus on credit card markets. If you are some other organization,
[34:25.150 --> 34:29.810]  if you want, you focus on inside the Fed markets or you focus on general markets.
[34:30.430 --> 34:36.550]  So in this way, you choose your target on the dark web. You prioritize risk as you can use
[34:36.550 --> 34:44.010]  pyramid of pain for that. So you prioritize risk and focus on IOCs that are relevant to your
[34:44.010 --> 34:53.390]  organization. Another thing is you don't just use one source to target, like there are many,
[34:53.390 --> 34:59.630]  many forums on the dark web. You don't just focus on one target, you focus on multiple targets.
[34:59.630 --> 35:05.290]  Apart from dark web, you focus on multiple clear websites also like
[35:05.290 --> 35:13.350]  pastebin or Twitter or nowadays on telegram also many these many actors are communicating.
[35:13.350 --> 35:20.050]  So you focus on the dark web also and the clear web also to get all the things you can
[35:20.050 --> 35:28.050]  for protecting your organization. So again, data collection processing, you collect data from the
[35:28.050 --> 35:35.210]  clear web and the dark web. So some of the sites are pastebin, Twitter, Reddit. On the dark web,
[35:35.210 --> 35:40.510]  it's forums, different forums and different marketplaces. You can do all this using this
[35:40.510 --> 35:48.150]  creepy crawler and parser that we discussed before. The analysis part in the threat intelligence
[35:48.150 --> 35:55.350]  model is you use, like I told you before, you use NLP machine learning or deep learning techniques.
[35:56.090 --> 36:03.810]  Some of them are like LDA, BERT, GPT to gather information related to your organization.
[36:04.010 --> 36:12.190]  You use social network analysis for analysis of different users on the dark web that post
[36:13.010 --> 36:19.750]  data related to your organization. There's clustering of products according to categories
[36:20.530 --> 36:26.790]  for the clustering thing. You classify different. So there's like binary classification,
[36:26.790 --> 36:31.810]  multi-class classification. So that's how you classify different products being sold on the
[36:31.810 --> 36:39.770]  dark web. So these all things come under analysis. I will touch a little on MITRE ATT&CK framework.
[36:39.770 --> 36:49.730]  So what MITRE ATT&CK is, it's a knowledge base of all the TTPs that was built
[36:50.370 --> 36:55.570]  using real world observations. So it contains different tactics, techniques, and procedures
[36:55.570 --> 37:03.170]  that attackers have used all these years. So you use ATT&CK matrix to map the intelligence you
[37:03.170 --> 37:10.290]  obtained to understand the TTPs better and to protect your organization more.
[37:13.500 --> 37:22.440]  So now coming to the operational security stuff. So hunting on the dark web or if you are doing
[37:22.440 --> 37:26.440]  human intelligence stuff on the dark web, you need to follow some set of
[37:28.700 --> 37:34.720]  processes so that you don't reveal your data or reveal your identity or your
[37:34.720 --> 37:43.320]  organizational identity as I talked before. So what is OPSEC? So OPSEC is the practice of
[37:43.320 --> 37:52.560]  hiding yourself online so that you don't reveal your real self or you don't reveal or compromise
[37:52.560 --> 37:57.740]  your own operations. It's derived from the US military, that's operational security.
[37:58.440 --> 38:10.680]  You need to hide your PIA, that's personal identifiable information. So you need to work
[38:10.680 --> 38:17.180]  on the dark web in such a way that you don't disclose your full name or driver's license or
[38:17.180 --> 38:24.720]  bank account or even a simple thing as email. This is what you need to protect and that's why
[38:24.720 --> 38:32.780]  security is really important. And it's also a hard thing to do because at the end of the day,
[38:32.780 --> 38:38.960]  we are all humans and we like to be seen as knowledgeable and we like to impress others.
[38:39.020 --> 38:44.800]  This all leads to gossip, gossiping, bragging and oversharing with others.
[38:44.800 --> 38:51.060]  That's why operational security is really hard. And most of the time people think of it as a
[38:51.060 --> 38:56.560]  process. So people think that, okay, I have to do human intelligence stuff. Now I have to follow
[38:56.560 --> 39:04.480]  operational security. It should not be like that. It should not be seen as a burden to perform or as
[39:04.480 --> 39:11.960]  another of your job tasks to perform. It should be a mindset like you should always think about
[39:11.960 --> 39:16.880]  operational security before doing human intelligence or before engaging with actors.
[39:17.820 --> 39:24.400]  So I will discuss some of the things that you can use to maintain
[39:24.400 --> 39:30.980]  opsec in your daily lifestyle. There are many other things. So the main thing you want is
[39:30.980 --> 39:38.240]  hiding your identity. So the first thing you can do or you should do is use a separate system
[39:38.240 --> 39:43.880]  like I talked about before. Also use a separate system where you don't store any personal
[39:43.880 --> 39:55.160]  information. Whether it be a lab or VM or some kind of system. The main thing is to use Tor
[39:55.960 --> 40:08.280]  with a proxy or Tor over VPN. The main thing is maintaining different personas on the dark web.
[40:08.280 --> 40:15.460]  It's like I told, it's an equivalent of an FBI agent going undercover. So he has some kind of
[40:15.460 --> 40:20.560]  persona. He has a backstory. You have to do that. You have to do the same thing. You should have
[40:20.560 --> 40:28.100]  different personas for different identities that you have on different forums. You should never
[40:28.100 --> 40:36.380]  mix it up. That's why you take extensive notes so that you don't mess up the personas.
[40:38.740 --> 40:45.240]  You should always watch what you say and you should always think before posting.
[40:47.240 --> 40:55.000]  Human intelligence is not a 95 job thing. You can't just talk to or you can't just communicate
[40:55.000 --> 41:07.200]  to actors during your job time because they will know that you are doing this as your job.
[41:07.200 --> 41:16.260]  In this way, you can be exposed. They can easily guess that you are a researcher and not a threat
[41:16.260 --> 41:28.400]  actor. It would be a tip off for them. That's why you have to do this work 24 by 7. It's not like
[41:28.400 --> 41:34.720]  you have to do this work on the weekends. You have to do this work after your work hours.
[41:34.720 --> 41:43.660]  That's why it's not a 95 job thing. You have to develop appropriate language skills because
[41:43.660 --> 41:50.000]  people don't talk formally like actors don't talk formally. You have to develop appropriate
[41:50.000 --> 41:56.560]  language skills or slang skills. Also, there are many forums like there are different Russian
[41:56.560 --> 42:01.380]  forums or German forums. You might need to develop that language skills like learning
[42:01.380 --> 42:10.640]  Russian, learning German. Another thing to note is changing time zones. If you are in US and you
[42:10.640 --> 42:16.740]  are accessing or you are engaging in a community on a Russian forum, you might want to change the
[42:16.740 --> 42:24.280]  time zone to Russia because again, it would be a tip off to the actors that you are a security
[42:24.280 --> 42:30.380]  researcher or that you are not a real actor. These are some of the things that should be noted
[42:30.380 --> 42:33.900]  before doing any intelligence stuff on the dark web.
[42:35.760 --> 42:42.400]  Now, that was it. So concluding all this, we discussed a little about the dark web, what dark
[42:42.400 --> 42:48.640]  web is, how to access dark web. We discussed about dark web forums and marketplaces, what
[42:49.400 --> 42:53.960]  different products are being sold there, what is the cost model around that.
[42:53.960 --> 42:58.340]  We discussed about threat hunting on the dark web, how you can hunt on the dark web.
[42:58.340 --> 43:05.360]  We discussed different tools and the main tool was scraping. That's the main framework that we
[43:05.360 --> 43:12.700]  are working on to hunt on the dark web. We discussed about human intelligence, how it can be
[43:12.700 --> 43:21.980]  used or how it should be used to support your tool-based data collection. We discussed about
[43:21.980 --> 43:25.700]  the pipeline or the architecture that can be used to automate the dark web hunting.
[43:26.300 --> 43:32.320]  We talked a little about a threat intelligence life cycle, how threat hunting on the dark web
[43:32.320 --> 43:38.360]  maps to the life cycle steps. And we talked about operational security and why it is so
[43:38.360 --> 43:49.900]  important and why it is so hard to do operational security. Again, some more points to notice,
[43:49.900 --> 43:55.200]  it's obviously threat hunting on the dark web is hard, but it's worth the effort. You don't get
[43:55.200 --> 44:01.480]  intelligence that you get from the dark web. You should always keep operational security in mind.
[44:02.640 --> 44:08.340]  Like I said before, you should look on more than one resource. You should look on forums. You
[44:08.340 --> 44:15.780]  could look on clear web forums also like pastebin and telegram as an example. And you should look
[44:15.780 --> 44:21.840]  on different other forums on the dark web also. It takes a lot of resources and a team effort. You
[44:21.840 --> 44:28.400]  can't do all these things alone. You should have a team for this. And we talked a little about
[44:28.400 --> 44:38.220]  usage of MITRE attack framework and how it can be useful to map your TTPs to that.
[44:39.880 --> 44:46.320]  These are some of the resources that I suggest you to read if you want to know more about the
[44:46.320 --> 44:52.840]  dark web stuff or dark web hunting stuff. The majority, the major ones are like a recorded
[44:52.840 --> 44:59.920]  feature insight, CrowdStrike, digital shadows. They release their blogs or white papers regularly.
[44:59.920 --> 45:07.140]  So read them and you'll understand what all things are there on the dark web.
[45:10.100 --> 45:17.520]  So yeah, that was it. I think, thank you so much. I hope you all liked my presentation and
[45:17.520 --> 45:23.080]  you can contact me on Twitter or LinkedIn if you have any doubt or if you want to discuss
[45:23.080 --> 45:30.920]  about this stuff more. I will hang on in discord to answer all your questions. So yeah, thank you.
